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ABSTRACT 

High-dimensional '-omics' profiling provides a detailed molecular view 
of individual cancers; however, understanding the mechanisms by 
which tumors evade cellular defenses requires deep knowledge of 
the underlying cellular pathways within each cancer sample. We ex- 
tended the PARADIGM algorithm (Vaske et aL, 2010, Bioinformatics, 
26, 1237-1245), a pathway analysis method for combining multiple 
'-omics' data types, to learn the strength and direction of 9139 gene 
and protein interactions curated from the literature. Using genomic 
and mRNA expression data from 1936 samples in The Cancer 
Genome Atlas (TCGA) cohort, we learned interactions that provided 
support for and relative strength of 7138 (78%) of the curated links. 
Gene set enrichment found that genes involved in the strongest inter- 
actions were significantly enriched for transcriptional regulation, apop- 
tosis, cell cycle regulation and response to tumor cells. Within the 
TCGA breast cancer cohort, we assessed different interaction 
strengths between breast cancer subtypes, and found interactions 
associated with the MYC pathway and the ER alpha network to be 
among the most differential between basal and luminal A subtypes. 
PARADIGM with the Naive Bayesian assumption produced gene 
activity predictions that, when clustered, found groups of patients 
with better separation in survival than both the original version of 
PARADIGM and a version without the assumption. We found that 
this Naive Bayes assumption was valid for the vast majority of 
co-regulators, indicating that most co-regulators act independently 
on their shared target. 

Availability: http://paradigm.five3genomics.com 
Contact: charlie@five3genomics.com 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 

1 INTRODUCTION 

High-throughput genomic technologies have created increasingly 
larger sets of data capturing the molecular status of cells, and 
these advances have had great impact in the identification and 
understanding of the cellular mechanisms altered in cancer. 
Identification of key targets commonly altered within specific 
tumors has enabled the creation of >40 targeted therapies over 
the past 20 years; however, the response rate of many of these 
drugs is still <50%, which highlights our incomplete understand- 
ing of the pathways around these drugs (Park et aL, 2008). An 
example of such a resistance mechanism is activation of the RAS 
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pathway in EGFR altered colon cancer tumors, where mutated 
KRAS constitutively activates the RAS cascade offering growth 
signals independently of the EGFR pathway, rendering EGFR 
blocking therapies such as cetuximab ineffective (Karapetis et aL, 
2008). By obtaining a more complete understanding of the key 
routes through which oncogenic signals travel within the cellular 
signaling networks, it should be possible to predict new drug- 
gable targets and identify escape routes through which tumors 
can evade existing treatments. 

Approaches for integrating -omics data at the level of path- 
ways have been increasingly popular in the last few years, with 
algorithms such as GSEA (Subramanian et aL, 2005), SPIA 
(Tarca et aL, 2009) and Pathologist (Efroni et aL, 2007) all cap- 
able of successfully identifying altered pathways of interest given 
pathways curated from literature (Varadan et aL, 2012). Another 
approach has constructed causal graphs from curated inter- 
actions in literature and used these graphs to explain expression 
profiles (Chindelevitch et aL, 2012). Algorithms such as 
ARACNE (Margohn et aL, 2006), MINDy (Bansal and 
Cahfano, 2012) and CONEXIC (Akavia et aL, 2010) take in 
gene transcriptional information (and copy-number, in the case 
of CONEXIC) to identify likely transcriptional drivers across a 
set of cancer samples; however, these approaches do not attempt 
to group drivers into functional networks identifying singular 
targets of interest (Eifert and Powers, 2012). Some newer path- 
way algorithms such as NetBox (Cerami et aL, 2010) and Mutual 
Exclusivity Modules in Cancer (MEMo) (Ciriello et aL, 2012) 
attempt to solve the problem of data integration in cancer to 
identify networks across multiple data types that are key to the 
oncogenic potential of samples. GIENA (Liu et aL, 2012) looks 
for dysregulated gene interactions within a single biological path- 
way but does not take in to account the topology of the pathway 
or prior knowledge about the direction or nature of the inter- 
actions (Faith et aL, 2007). Probabihstic graphical models have 
been used extensively in network analysis with landmark uses in 
the form of Bayesian Networks (Segal et aL, 2003) and Markov 
Random Fields (Letovsky and Kasif, 2003). Several methods 
have successfully learned interactions from data through many 
different means, including relevance networks (Faith et aL, 2007). 
Our algorithm, PARADIGM (Pathway Recognition Algorithm 
using Data Integration on Genomic Models) (Vaske et aL, 2010), 
uses a probabilistic graphical model to integrate multiple gen- 
omic data types on curated pathway databases and is unique for 
its per- sample approach that allows individual samples to be 
assessed alone or within the context of a cohort of interest. 
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Learning sensible parameters for gene interactions is essential 
for PARADIGM to infer activities within each sample. In the 
original implementation of PARADIGM, expectation-maxi- 
mization (EM) parameter learning was only performed by de- 
fault on observational data parameters, as the limited size of 
available datasets prevented robust estimation of interaction par- 
ameters. When using a standard conditional probability table 
with discrete variables, the size of the table, and therefore the 
number of parameters, grows exponentially with the number of 
regulators, presenting additional challenges to efficient estima- 
tion of interaction parameters. However, by assuming a condi- 
tional independence of the regulators, we can replace the 
conditional probability table with a product of independent regu- 
latory factors, and the number of parameters grows linearly with 
the number of regulators. 

Examination of statistics related to regulatory links, rather than 
individual gene activities, is a view related (but not directly corre- 
lated) to that which is normally examined in high-throughput 
studies. By identifying regulatory links that have significantly dif- 
ferent usage distributions within a phenotype of interest in the co- 
hort, we can begin to examine how different regulators within a 
network might produce similar cellular phenotypes despite using 
entirely different pathways to accomplish them. Additionally, 
these learned parameters can then be used as the basis for statis- 
tical tests to estabhsh how well individual samples or subsets of the 
cohort follow the distribution of learned parameter patterns for 
each regulatory node. This article describes in detail this new ap- 
proach and shows the overall improvement and additional ana- 
lysis capabilities when applied across the TCGA data. 



2 METHODS 

PARADIGM represents the states of biological molecules — e.g. proteins, 
mRNAs, complexes and small biomolecules — from a tumor sample as 
variables in a probabilistic graphical model. For every gene, we use vari- 
ables for the genome copy number, mRNA and protein, and additionally 
a non-physical variable that corresponds to biological activity of a gene, 
as annotated in a pathway, and which may be regulated by posttransla- 
tional modification of the protein. Additionally, there are variables that 
represent more abstract states, such as apoptosis, that are commonly 
annotated in pathways. 

Causal interactions that change the state of molecules — e.g. gene tran- 
scriptional regulation, protein phosphorylation and complex formation — 
are represented as directed edges from the regulating variable to the 
regulated variable. For each variable Y in the probabilistic graph, we 
introduce a factor into the joint probability model relating the state of 
the variable to the state of all its regulators: F{Y\Xi,X2, . . . , X^), where 
Xi through are the variables that regulate Y. This factor is a 
conditional probability table: for each setting of Parents(Y), 
J]^,^j^F(T = j|P<3r£'/i/^(y)) = 1. Observations of individual variables, 
such as the genome copy number or gene expression, are modeled as 
separate variables, connected to the latent variable by a factor F{Y\X), 
also a conditional probability table. The full joint probability state 
is then: 



P{Q) = -W F(T|Parents(y)) 



(1) 



where Z is a normalization constant required due to regulatory cycles in 
the pathway. 

Given observations for a sample, we solve for marginal distribution 
of each unobserved variable, using the loopy belief propagation 



implementation in libDAI (Mooij, 2010) with inference performed in 
the probability space (as opposed to log space), a convergence tolerance 
of 10~^ and with the SEQFIX update schedule. The parameters for all F 
functions are learned via expectation maximization in libDAI, stopping 
when the ratio of successive log- likelihoods is <10~^^. 

In this work, we have introduced new variables into each gene's central 
dogma that correspond to the transcriptional, translational and protein 
regulation states of each gene, as shown in Figure 1 A. This central dogma 
means that each protein-coding gene will have identical central dogma 
structure, and therefore we are able to share parameters between all 
genes. The unique regulatory program is then modeled only in the tran- 
scription, translation and protein regulation variables for each gene. 

Regulation models: We extended PARADIGM by altering how regu- 
lation nodes are handled by the algorithm. To construct a factor graph 
and allow for comparison between many types of data, PARADIGM 
discretizes the input data to down, up or normal relative to some control. 
Regulation nodes collect activity signals of all of the genes involved in 
regulation of a given gene at some point along the path from DNA to 
active protein. These signals are collected in a single variable which con- 
nects to a gene's central dogma structure through a factor. Under the 
original model, regulation nodes would simply take a vote of incoming 
signals to decide if an activation or inhibition signal was passed along. In 
this new version, we learn the likelihood of each setting of the child 
variable Y being passed given the setting of the parent nodes 
Xi, . . . , Xj^. In this article, we contrast both the co-dependent and inde- 
pendent regulation models shown in Figure IB. With the co-dependent 
regulation model, this probability is stored directly as a parameter in a 
conditional probability table for all possible settings of the parents and 
child. In contrast, with the independent regulation model, we use P{Y) 
and P{Xi\ Y) as parameters and simply calculate the product of the par- 
ameters to find this probability: 



F{Y\Xu ■■■,Xn) = -P{Y)Y[P(X,\Y) 



(2) 



where Z is a normalizing constant that corresponds to P{X\ , . . . , Xjq). To 
initialize the parameters for the independent regulation model, we give 
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Fig. 1. Factor graph structures in PARADIGM. (A) Central dogma 
structure shared by all protein coding genes. (B) Alternative regulation 
models for the transcription, translation and activation nodes. In the Co- 
dependent Regulation Model, we learn a full conditional probability table 
of the child given the parents, while in the Independent Regulation 
Model, we learn conditional probabilities of individual links and use a 
Naive Bayes assumption to calculate the probability of the child node 
given the parents 
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P(T) an equal probability of down, up or normal, and we set the initial 
probability for P(Xi\Y) based on the annotation of the link in the 
pathway. For links labeled as activators P(down\down) = 
P(normal\normal) = P{up\up) = 0.8 and for inhibitors P(down\up) = 
P(normal\norma[) = P(up\down) = 0.8 with all the probabilities of all 
other settings set to 0.1. We also performed tests using a uniform distri- 
bution across all settings to test the importance of using this prior know- 
ledge from the pathway. 

We use the same simple voting procedure that was originally used 
in PARADIGM as the initial parameters for EM learning in the 
co-dependent regulation model. We use e = 0.001 so 99.9% of the prob- 
ability is placed in the child state that wins the vote and 0.05% is placed in 
the other states as the initial likelihoods. 

An additional minor change to the original PARADIGM algorithm is 
that we now allow 'activation' regulation of complexes and gene families 
between the protein and active states. Specifically, each family and com- 
plex is now modeled by a trio of variables: family/complex, regulation 
and active, connected with a single factor F(active\regulation, 
family /complex). Regulators of the family or complex are connected to 
the active variable, with either the co-dependent or independent regula- 
tion model. Components of the family or complex are connected to the 
family/complex variable, using either a noisy-min or noisy-max factor, 
with € = 0.001. Only the noisy-min or noisy-max factor was used in ear- 
lier iterations of PARADIGM. 

Regulation statistics: We use G-tests to determine the significance of 
the dependence between parents and children of regulatory links (3) as 
well as the significance of the conditional dependence between parents 
given a child distribution (4): 



-2N^P{Xi,Yj)\n 



P{Xu Yj) 
P(XdP(Yj) 



(3) 



(4) 



The G-test follows the distribution, so we can find P- values using 
distributions with 4 and 12 degrees of freedom for the parent-child test 
and the parent-parent test, respectively. P-values are adjusted for false 
discovery rate (FDR) and links with adjusted P<0.05 are considered 
significant. 

Although the G-test (which is proportional to the mutual information) 
tells us how strong an interaction is, it doesn't give us details about the 
sign of the interaction (i.e. activation is a positive interaction and inhib- 
ition is a negative interaction). To get these details, we calculated both the 
Pearson correlation between the parent and child, and the weighted 
pointwise mutual information, or WPMI (5) (Raina et al., 2006) at all 
possible settings of parent and child. Correlation was calculated using the 
joint distribution P(Xi, Y) = P(Xi\Y)P(Y), and significance was calcu- 
lated using the Fisher transformation. Correlation between two parents 
given the child was also calculated to determine if the three nodes formed 
a coherent or incoherent feed forward loop. 

To compare G-test results between groups, we took the differences of 
the ranks of the G statistic in each group. The significance of this statistic 
was calculated by performing a permutation test with 5000 random 
permutations of the group membership and then adjusting for FDR. 
For differences greater than any of those observed in the permutations, 
the lowest possible P- value was used as an upper bound. 



WPMIij = P(Xi, Yj)\n 



PiXdPiYj) 



(5) 



The WPMI is simply each individual element of the G-score sum. We 
arrange the vector of 9 WPMI values as an easy to interpret heat map. 

Ovarian clustering: We used the HOPACH clustering algorithm 
from Bioconductor (van der Laan and Pollard, 2003; Pollard et al., 
2012), which attempts to find the number of clusters that best fits the 
data. This results in different numbers of clusters for each set of IPLs 
clustered, so to find clusterings with a consistent number of clusters be- 
tween all datasets, we collapsed the smallest clusters by reassigning small 
cluster members to the closest large cluster. We collapsed small clusters in 
this manner to get a consistent number of clusters across all of our 
clusterings. This method also served to keep cluster sizes consistent 
across our comparisons. 

2.1 Genomic and pathway data 

Genomic and pathway data (Matthews et al., 2009; Schaefer et al., 2009) 
detailed in Supplement 1 . 

Set enrichment: We used DAVID (Huang et al, 2008; Sherman et al, 
2009) to perform gene set enrichment on the genes involved in inter- 
actions learned by PARADIGM. To maximize number of genes recog- 
nized by DAVID, we split gene complexes and families into their 
component genes. Enrichment for genes involved in links was compared 
to a background of all of the genes in our curated pathway. 

Intermediate nodes: A full conditional probability table with N par- 
ents will store probabilities for all 3^+^ possible settings of parents and 
children. Some central genes in our curated pathway have >30 regulators, 
so to prevent the size of these tables from becoming prohibitive, we 
limited the number of parent nodes that could be attached to a child 
node to 5. For genes regulated by more than five proteins, we added 
intermediate nodes to the graph to maintain this limit. E.g. a gene with 
10 regulators would have two intermediate nodes with five regulators 
attached to each intermediate node. 

Coxnet feature selection: The TCGA cohort was first subset down 
to the 364 breast cancer samples labeled as ER+ in the associated clinical 
data. Survival censoring was determined by the Vital Status label and if a 
patient was not alive the Days to Death data was used, otherwise Days to 
last known alive (if present) or Days to last follow-up were used. Both 
IPLs and sample-specific link g- scores were filtered to the top 5% of 
features by variance, and features were then z-score transformed to 
normalize variance between the two types of values. The selection was 
performed by using the cox method in the glmnet package (Friedman 
et al., 2010; Simon et al., 2011) version 1.9-1 in R version 2.15.2 with a 
maximum iteration of 100000 which completed without warnings. 



3 RESULTS 

We learned regulatory interactions on a dataset of 1936 TCGA 
tumor samples with gene expression and copy number data, from 
11 tissue types. We then assessed interaction significance by a 
G-test and interaction sign with a correlation value. Of the 9139 
interactions in the pathway model that regulate a protein, 763 1 
(83.5%) were found to be significant at an FDR of 0.05. A prin- 
cipal component analysis (PCA) of the WPMI vectors for each 
interaction learned across the entire TCGA cohort (Fig. 2) 
reveals a gradient from strong inhibition to strong activation. 
K-means clustering of the WPMI vectors found clusters along 
this gradient representing canonical interaction types ranging 
from strong activation to strong inhibition. Of 763 1 significant 
links, 78 (1%) were placed in a cluster where the centroid was 
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Fig. 2. (A) Principal component analysis of regulatory links in the TCGA cohort. Each point is the projection of the 9 WPMI scores for a link onto the 
top two principal components. The convex hulls show the membership of k-means clustering performed on the (unprojected) WPMI scores, and the 
cluster numbers are placed at the centroid of each cluster. (B) Cluster membership of significant links labeled as activation and inhibition in the pathway. 
(C) Heatmaps of the WPMI values of the centroids of the clusters show a range from strong inhibition (1) to strong activation (5) 



going the opposite direction of how the link was annotated in the 
pathway. The variety of WPMI vectors shows that EM was able 
to learn new interaction regimes that resemble activators and 
inhibitors as well more complex regulatory patterns. 

Using correlation measure (see Section 2), we assessed each 
interaction as activation or inhibition and compared with the 
interaction type annotated in the pathway model. There were 
7357 links with both significant correlation and g-scores and of 
those the correlation of 219 links (3%) did not agree with the 
direction of regulation in the pathway. This leaves 7138 (78%) 
links that are significant by both tests and agree with the curated 
links. We also found that that some links had high correlation 
values but low significance from our g- tests; this usually hap- 
pened in cases where either the parent or child distribution 
greatly favored a single state. 

We compared these results to what could be found by a 
straightforward Pearson correlation of gene expression profiles. 
Because we can not look at expression profiles for famihes and 
complexes, we tried two different approaches for this compari- 
son. First, we compared our results to the expression correlation 
of links not involving complexes or famihes. Of the links learned 
by Paradigm, 1197 had significant correlation and g-scores and 
did not include complexes or famihes. For 51 of these links 
(4.3%), the sign of correlation coefficient disagrees with the lit- 
erature. On the other hand, looking only at gene expression pro- 
files, we found 1058 non-complex non-family links with 



significant correlation, but 470 (44%) disagreed with the sign 
of the pathway entry. For our second comparison, we eliminated 
complexes and families in our pathway by connecting all genes 
that were components of famihes and complexes directly to any 
gene regulated by those famihes and complexes. This flattening 
procedure resulted in 200921 links. We found that 165 258 of 
these links had significantly correlated gene expression profiles, 
and that 81 558 of the links (49.4%) had correlation that dis- 
agreed with the direction of the link in the pathway. These results 
indicate that the links learned by paradigm are much more in 
agreement with the direction of the links in literature than the 
correlation of gene expression profiles is. 

Running the PCA and clustering analysis on only WPMI 
scores learned from TCGA Ovarian (OV) patients (7V=416) 
and without complex and family activation regulation produced 
very similar results to the PCA and cluster centers shown in 
Figures 2A and C, but found fewer significant links and a 
higher proportion of links that were annotated as activators, 
and learned as inhibitors or vice versa (Fig. 3 A). When we 
used a flat initialization of P{Xi\Y) = 1/3 (Fig. 3B), we found 
that the cluster centers again mapped to a gradient from activa- 
tion to inhibition, and there were fewer significant links and 
a higher proportion of link direction disagreements than with 
initial settings that include direction information. 

We expected the reduction in significant links between the 
entire TCGA cohort and just OV samples because datasets 
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Interaction type Interaction type 

Informative Initialization Flat Initialization 

Fig. 3. (A) Cluster membership bar plots for WPMI values of significant 
links learned from the ovarian cohort using an informative prior. 
(B) Clustering membership when starting with a flat prior. Cluster cen- 
ters range from strong activation (blue) to strong inhibition (red) as in 
Figure 2C 



with a larger number of samples have more evidence to support 
even subtle interaction trends. The increase in proportion of links 
in clusters that do not agree with their annotation could be ex- 
plained by the difference in sample number because any outliers 
would have less effect on probabihties calculated across the entire 
dataset. In this setting, however, these changes of link signifi- 
cance or direction are of interest because the TCGA cohort con- 
tains many different diseases and tissue types. The pathway does 
not include details about the tissue or disease in which a given 
interaction was observed in the literature, so it is possible that in 
a different tissue or disease state and interaction direction could 
change either through mutation or some other mechanism. These 
direction changes are worth studying because they may give us 
some insight into the mechanism of the disease. Our tests with 
flat initialization show that even without starting the link par- 
ameters in linear activation/inhibition states, a gradient across 
these linear relationships can be learned. Many of the links that 
lose significance come from the middle cluster, which has little 
positive or negative correlation, but we also lose most of the links 
from the clusters with the strongest activation and inhibition 
signals. This suggests that by not using prior knowledge about 
link type, we lost some of our strongest interactions that could 
have been biologically relevant and could also reduce the quality 
of our final protein activity predictions. 

To test the Naive Bayes independence assumption presented in 
Figure 1 , we ran PARADIGM with both the independent and co- 
dependent regulation models on the TCGA ovarian cancer sam- 
ples. We tested the conditional independence assumption on the 
expectations calculated at each EM step of the PARADIGM run 
(Fig. 4A). At every step of training, fewer co-regulators were 
found to be dependent upon each other. Because of small feed- 
back loops in the pathway, such as a transcription factor that 
regulates its own transcription, we expect that the independence 
assumption will fail in some cases. Additionally, it is quite 
common for two very similar complexes, differing by only one 
molecule, to co-regulate the same child node, in which case we 
also expect the conditional independence test to fail, despite there 
being little conflict. Consequently, we divide the cases where two 
co-regulators fail the independence test into 'coherent' and 
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Fig. 4. (A) Percentage of unique child nodes that fail the following tests at 
each EM step of a PARADIGM run learning a full conditional prob- 
ability table: /. a test of the significance of conditional independence of 
any two parents given the child. //. test / and at least one of the parents 
that fails is significantly linked to the child, in. test / and the failing triplet 
is incoherent, iv. tests /, // and (B) Examples of coherent versus inco- 
herent triplets. The arrows correspond to correlation with a pointed head 
for positive correlation (activation) and a flat head for negative correl- 
ation (inhibition). The interactions between parents are not found in the 
literature, so we use double sided arrows because we can not know the 
direction of that interaction 



'incoherent' classes, as shown in Figure 4B. Additionally, two 
co-regulators may fail the independence test even if one of the 
co-regulators is an insignificant regulator, owing to the strength of 
the other regulator. We therefore also consider the subset of cases 
where both co-regulators are significant on their own. Our tests 
show that the initial parameters produced by the weighted vote 
method cause almost 50% of child nodes to fail the conditional 
independence test, but as the EM algorithm learns more likely par- 
ameter settings, fewer and fewer nodes fail the test. Combining all 
of our tests shows that only ~5% of child nodes are likely to have 
codependent regulators in a meaningful way. 

Using the ovarian cancer samples, we clustered the protein 
activity predictions produced by the original PARADIGM and 
those from both the co-dependent and independent regulation 
models. We then performed Kaplan-Meier analysis on these 
clusters to see whether they had significantly differential survival 
profiles (Fig. 5). We found that the clusters produced using in- 
dependent regulation model activity predictions were the most 
separable by their survival (log-rank P = 2.0x 10~^^). We also 
performed this test using the independent regulation model with 
a flat initial setting for the P(Xi\ Y) parameters and found that it 
performed worse than the original PARADIGM model. Again, 
this indicates that our learning method requires prior knowledge 
about the type of interaction that is lost when using a flat initial 
interaction setting. 

Figure 6 shows tissue-differential link usage in the most 
significant by coloring each interaction by its correlation score 
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Fig. 5. Kaplan-Meier survival curves of 416 patients in the TCGA ovarian cohort clustered by Integrated Pathway Activity using (A) the original 
PARADIGM implementation, (B) PARADIGM learning full conditional probability tables of regulatory nodes and (C) PARADIGM learning con- 
ditional probability of single links and using a naive Bayes assumption 
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Fig. 6. Heatmap of the g-score ranks colored by link correlation, with red 
tending towards activating and blue tending towards inhibiting. For visu- 
alization purposes, interactions were filtered if they had a standard devi- 
ation <0.2 across all samples or did not have at least one tissue with a 
score of >0.7, resulting in 211 interactions out of the original 10307 



in a tissue and setting its saturation proportional to its signifi- 
cance. The strongest differential g-scores were seen for links 
regulated by key cancer genes and complexes, including TP53, 
MYC/MAX, HIFIA/ARNT, TAp73a, E2F1 and PPARA- 
RXRA. Of particular interest are the links regulated by 



PPARA-RXRA primarily different within GBM [brain and 
KIRC (kidney)] and the TAp73a regulatory links in OV (ovar- 
ian) and to a lesser degree in UCEC (uterine endometrioid). 
Figure 7 shows a plot of the WPMI signals grouped by tissue 
for the activating links from PPARA-RXRA and TAp73a, 
where significantly increased weights are found on the activating 
diagonal, indicating increased use of these links as activators 
in those tissues. The signature of TAp73 activity potentially 
indicates a female reproductive or hormonal pattern of patho- 
genesis associated with p73 expression. TAp73 promotes the 
expression of cell cycle inhibitors and inducers of apoptosis, 
one of which is the tumor suppressor BAX, which acts as an 
inhibitor of the activity of the oncogene BCL2. BCL2 is known 
to be highly expressed in serous ovarian cancer, and our results 
here show that although TAp73 is highly expressed and is a 
strong promoter of BAX expression (and thus BCL2 inhibition), 
it is nonetheless ineffective in retarding tumor igenesis, suggesting 
that small molecule inhibition of BCL2 may be equally ineffect- 
ive. Not surprisingly, single-agent treatments of ovarian cancer 
with small molecule inhibitors of BCL2, despite high BCL2 ex- 
pression in serous ovarian cancer, have not succeeded to date 
(Simonin et al., 2013), suggesting a downstream blockade or 
attenuation of TAp73-mediated activity in this type of cancer. 
It is important to note that almost all of the serous ovarian 
samples here bore mutations in p53, perhaps suggestive of an 
upstream shunting of tumorigenesis as well that perhaps over- 
comes TAp73 over-expression or increased activity. Other 
groups have additionally shown the importance of PPARA- 
RXRA activity in both GBM and KIRC and their sensitivity 
to fenofibrate, a PPARA agonist (Giordano and Macaluso, 
2012; Ganti et aL, 2012). The tissue- specific signals identified 
through this analysis appear to reiterate recent biological discov- 
eries that appear to be unique when examined in the context of 
the current TCGA dataset. 

The most significant links learned across the entire TCGA 
cohort (Table 1) are a number of known cancer genes including 
the forkhead box transcription factor Al, p53 and estrogen re- 
ceptor alpha. To perform a gene set enrichment with DAVID 
(Huang et aL, 2008; Sherman et aL, 2009) on the genes involved 
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Fig. 7. Boxplots of WPMI values across cancer types (A) WPMI values for links with PPARA:RXRA as a parent node. There is a stronger activation 
signal in GBM and KIRC. (B) WPMI values for links with TAp73a as a parent node, showing activation in OV 



Table 1. Regulatory links with the highest g test score across the entire 
TCGA cohort 



Parent 


Child 


g score 


Direction 


FOXAl 


SFTPA (family):txreg 


3247 


197 


t 


HNFIA 


HNF4A (family):txreg 


3208 


440 


t 


GATAl 


alpha-globin (family) :txreg 


3065 


885 


t 


ONECUTl 


HNFIB (family):txreg 


3008 


945 


t 


p53 tetramer 


MDM2:txreg^ 


2931 


148 


t 


(complex) 










KLF4 


Preproghrelin (family) :txreg 


2914 


620 


t 


PDXl 


NR5A2 (family):txreg 


2872 


275 


t 


p53 tetramer 


SFN:txreg^ 


2811 


958 


t 


(complex) 










ER alpha 


alpha tubulin (family) :txreg 


2781 


369 


t 


homodimer 










(complex) 










FOXMl 


CENPA:txreg 


2739 


028 


t 



P- values for all link are less than le-323. 
^Intermediate node. 



in the 50 interactions with the highest G-scores, we replaced 
families and complexes with their component genes. This pro- 
duced 112 unique genes that were recognized by DAVID from 
the top 50 links. These genes were significantly enriched (P< le- 
7) for a number of relevant KEGG terms including 'pathways in 
cancer', 'apoptosis', 'Jak-STAT signahng pathway' and 'MAPK 
signaling pathway' as well as a number of different cancer type- 
specific terms. We compared this result with what could be found 
by only looking at gene expression correlation of the genes that 
are linked in our pathway. We needed to take the top 200 gene 
expression pairs by Pearson correlation from the flattened path- 
way to get a set of unique genes of comparable size (A^= 119) to 
the set produced by Paradigm. Although both gene sets 



produced similar enrichments for Gene Ontology terms for bio- 
logical processes (GOTERM_BP_FAT), we found far fewer 
KEGG terms by using gene expression correlation than by 
using our learned links (20 versus 46 at FDR < 0.05) and the 
FDR. The KEGG terms that overlapped between the two sets 
had a lower FDR in the PARADIGM set. To ensure that the 
flattening of famihes and complexes in the pathway was not 
biasing these results, we repeated this analysis for non-family, 
non-complex links in the pathway only and found similar results 
(20 KEGG terms found for Paradigm links versus 3 for expres- 
sion correlation at FDR<0.05). 

We compared the strength of links between subtypes of breast 
cancer to get some insight into the regulatory differences between 
the subtypes (Table 2). This comparison as well as other com- 
parisons between tissues never found links that completely 
switched direction from activation to inhibition. Instead, we 
often observed that links turned off or on (e.g. changed from a 
strong activator to neutral). Because direction rarely changes, we 
found it informative to simply look at the differences between the 
G-score significance of links. We used the rank difference of the 
G-scores to allow us to compare between groups so as to adjust 
for the G-score's dependence on sample size. Many of the links 
with the highest rank differences had the same parents, so 
Table 2 shows the links with the highest rank difference on a 
per parent basis and include the full table as Supplement 2. In 9 
of the top 10 links that were stronger in Basal tumors, HIFIA 
was the parent, and the top four links stronger in Luminal A 
tumors had CEBPB as a parent. 

To identify clinically relevant activities and link strengths, we 
examined the estrogen receptor-postiive (ER+) breast cancer pa- 
tients. We performed a regularized Cox regression of TCGA 
survival data on both hnk g-scores and IPLs to identify the op- 
timal number of features to best split the cohort. At the min- 
imum lambda, the coxnet model contained nine features that 
best split the ER + breast cancer patients (Table 3). Four of 
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Table 2. Regulatory links with adjusted P<0.05 in either Basal (A^= 92) or Luminal A (A^= 218) breast cancer tumors, and the highest rank differences 
in G-scores per parent 



Parent 


Child 


P- value Basal 


P-value Luminal 


Rank difference 


Direction 


HIFIA/ARNT (complex) 


HKl 


1.61e-3 


0.834 


7826 


t 


E2F3/DP/TFE3 (complex) 


RRMl 


9.20e-3 


0.854 


7632 


t 


MYB 


PPP3CA 


3.09e-2 


0.493 


5203 


t 


E2F1/DP (complex) 


WASFl 


3.48e-2 


0.459 


4924 


t 


E2F1/DP/PCAF (complex) 


TP73 


6.59e-3 


0.343 


4225 


t 


CEBPB 


HSP90B1 


0.879 


9.65e-3 


6275 


t 


JUN 


AChR (family) 


0.833 


0.0256 


4742 


t 


SPl 


CDKN2C 


0.771 


5.94e-4 


4700 


Not significant 


DNA damage (abstract) 


SERPINB5 


0.808 


0.0300 


4264 


t 


LEFl/beta catenin/PITX2 (complex) 


LEFl 


0.775 


9.18e-3 


4250 


t 



Note: Adjusted P of all rank differences in this table was < 4.8e-4. All edges were annotated as transcriptional activators. Full table is Supplementary Material. 



Table 3. Pathway features (edges and nodes) associated with survival in 
ER + breast cancer patients 



Feature 


Cox hazard coefficient 


GLI2A^GLI1 


0.08484 


HIFIA/ARNT (complex) ^ CP 


0.07835 


MYB ^ CEBPB 


0.00462 


E2F1/DP (complex) ^ SIRTl 


-0.00072 


p300/CBP (complex) 


-0.00204 


SDC3 


-0.04840 


p300/CBP/RELA/p50 (complex) 


-0.11126 


TAp73a (tetramer) (complex) 


-0.11301 


TCF IE/beta catenin (complex) 


-0.16129 



Note: Edges are identified by and all edges found are annotated as transcrip- 
tional activators in the pathway. 



the nine features were link g-scores, which illustrates the inde- 
pendent utility of these scores as potential prognostic markers. 
Additional work is needed to validate this model in an independ- 
ent dataset before it can be considered a true prognostic signa- 
ture in ER + patients. 

CEBPB and HIFIA/ARNT appeared in both Tables 2 and 3. 
CEBPB is a transcription factor that has been associated with 
tumor progression, poor prognosis and ER negative status 
(Milde-Langosch et al., 2003). Furthermore, over expression of 
HSP90B1, a heat shock protein regulated by CEBPB and found 
in Table 2, has been associated with distant metastases and 
decreased overall survival in breast cancer patients with other- 
wise good prognoses (Cawthorn et al., 2012). HSP90B1 has 
undergone clinical trials as an immunotherapy for melanoma 
under the name vitespen (Testori et al., 2008). HIFIA/ARNT 
overexpression is clinically relevant in ER— and PR— breast 
cancer, where splice variants have been associated with reduced 
metastasis-free survival (Dales et al., 2010). Because Basal 
tumors are generally ER— , and Luminal A tumors are generally 
ER+, the differential link strength could be due to increased 
occurrence of the splice variant in the Basal tumors. The top 



two links by G-score rank difference between Basal and 
Luminal are HIFIA/ARNT activating HKl and HK2 (hexoki- 
nases), HK2 is involved in glucose metabohsm and apoptosis, 
and has been associated with brain metastases from breast can- 
cers as well as poor survival post craniotomy (Palmieri et al., 
2009). These findings indicate that we are able to find links 
that are relevant both by contrasting between tumor subtypes 
and by searching for links within a subtype that are predictive of 
a chnical variable. 



4 DISCUSSION 

We have shown that by extending PARADIGM, we can com- 
bine multiple -omics data to learn the strength and sign of regu- 
latory interactions curated from the literature. The assumption 
of conditional independence enables a reduction in model com- 
plexity allows efficient estimation of regulatory parameters using 
existing datasets, and further, and we show that the independ- 
ence assumption is valid for the vast majority of regulatory pro- 
grams. In addition, where the independence assumption does not 
hold, future extensions would be able to replace the independent 
factors with more complex factors that properly model a co-de- 
pendent regulatory program. When these learned parameters are 
applied, biological insight can be gained from simply looking at 
the strongest links across a cohort of samples or by looking at 
how interactions change between phenotypes of interest. This 
regulatory learning improved PARADIGM'S overall protein ac- 
tivity predictions, resulting in better separation of survival across 
clusters of ovarian cancer patients. 

We find that though cancer subtypes use different interactions, 
an interaction generally has a consistent sign whenever it is used 
in a particular tumor. This indicates that our current level of 
knowledge of cofactors as able to account for the cases where 
a gene switches the direction of activity. Further, the concord- 
ance of our learned interaction sign and the interaction sign in 
databases, despite the various ways that interaction sign is anno- 
tated in the BioPAX language across pathway databases, indi- 
cates that pathway databases have successfully and faithfully 
cataloged of thousands of wetlab experiments in the literature. 
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The independence of co-regulators provides computational 
benefits for model inference and parameter learning, and also 
aids in model interpretation. The factor ability of regulation 
models corresponds to log-linearity. However, a great number 
of regulators in the model are complexes, and the complex 
formation factor is a non-linear noisy-MAX function. Thus, 
regulation nonlinearity can still be encoded in the factor graph 
by representing physical complexes. This lends plausibility to a 
physical interpretation of most regulation links in the pathway: 
competitive binding of independent regulators should combine 
linearly, as long as the truly independent physical entities have 
been captured as complexes. If this physical interpretation 
is true, then there should be a correspondence between relative 
strengths of measured physical binding constants and 
PARADIGM interaction scores. In cases where the independ- 
ence assumption does not hold, it is likely that there is a latent 
co-factor, which could be modeled by replacing P(Y\Xi)P(Y\X2) 
with a factor such as P(Y\X\,X2). 
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